Red Wine Exploration by Sam Chen

Overview: This data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Key question: “What chemical properties are most important in terms of predicting the quality of wine?”

Univariate Plots Section

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Our data consists of 13 variables, with almost 1,600 observations. There are 11 variables on the chemical properties of the wine.

As the key question is to understand which chemical properties are the most important of predicing the quality of red wine, I would like to first look at all the distribution plots for all the 11 chemical properties to see if there’s anything catching my eye.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

First, let’s take a look for the distribution of quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

All the red wine quality ratings are range from 3 - 8. There’s no any wine that was rated super bad(0-2) or super awesome(9-10). Moreover, most of the wines are rated between 5-6. The mean of the quality ratings is 5.636 and the median of the quality ratings is 6.000.

Now’s take a closer look at the 11 chemical properties within the data set.

  1. fixed.acidity(original)

The histogram of fixed.acidity shows that fixed.acidity is slightly skewed to the right. Let’s see the log plot.

  1. fixed.acidity(log)

The Log plot is much closer to the bell curve of a normal distribution. Let’s see the summary:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The mean for fixed.acidity is 8.32 and the median is 7.90. Most of the red wines have the fixed.acidity between 7.10 to 9.20.

  1. volatile.acidity

The histogram of volatile.acidity shows that volatile.acidity is slightly right-skewed. Let’s the summary of volatile.acidity:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Most of the red wines have the volatile.acidity from about 0.40 to o.65.

  1. citric.acid

The histogram of citric.acidity shows that most of the data has citric.acid = 0. This may need future exploration. Let’s see the summary for citric.acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The smallest value for citric.acid is 0.000 and the largest value for citric.acid is 1.000. However, 75% of the values are below 0.420. Let’s see the square root distribution.

  1. citric.acid(sqrt)

Since in the first plot, 0 is the value that has the largest count, after taking the square root, most of the values are still 0. As most of the data are between 0 and 1, after taking the square root, they will become bigger value. As a result, there is a blank spot in the plot. I wonder how citric.acid is connected to quality, and I wonder if the citric.acid values are specific to certain alcoholic levels.

  1. residual.sugar

From the histogram of residual.sugar, we can see that it’s very long-tailed. Let’s see the log plot.

  1. residual.sugar(log)

We can see from the log plot that the distribution for residual.sugar is more close to normal distribution. Let’s see the summary ofresidual.sugar:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Most red wines have residual.sugar values between 1.9 to 2.6.

  1. chlorides

From the histogram of chlorides, we can see that it’s very long-tailed. Let’s see the log plot.

  1. chlorides(log)

The Log plot is much closer to the bell curve of a normal distribution. Let’s see the summary for chlorides:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Most of the red wines have chlorides range between 0.07 to 0.09. 75% of the red wines have chlorides below 0.09.

  1. free.sulfur.dioxide

The histogram of free,sulfur.dioxide is right-skewed. Let’s see the log plot.

  1. free.sulfur.dioxide(log)

Let’s see the summary for free.sulfur.dioxide:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The largest value for the free.sulfur.dioxide in a red wine is 72.00 which is far away from the 3rd Quantile(21.00) and median(14.00).

  1. total.sulfur.dioxide

The histogram of total.sulfur.dioxide is long-tailed. Let’s see the log plot.

  1. total.sulfur.dioxide(log)

The log plot for total.sulfur.dioxide is close to the bell surve. Let’s see the summary of total.sulfur.dioxide:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

For the distribution of total.sulfur.dioxide, the mean value is 46.47, and the median value is 38.00. 75% of the total.sulfur.dioxide value is below 62.00 while the max value is 289.00.

  1. density

The distribution for density shown in this histogram is close to the pattern for normal distribution. Let’s see the summary of density:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Combine the results of the histogram and summary, we can see the density is close to normal distribution, range from 0.9901 to 1.0037. The median(0.9968) and mean(0.9967) is almost the same!

  1. pH

The pattern for pH’s histogram is close to normal distribution. Let’s summary of pH:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The max value of pH is 4.010 which indicates that all the red wines are acid(pH<7). Combine the results of the histogram and summary, we can see the pH is close to normal distribution, range from 2.740 to 4.010. The median(3.3310) and mean(3.311) is almost the same!

  1. sulphates

It seems that the distribution for sulphates is right-skewed. Let’s see the log plot.

  1. sulphates(log)

The log plot for sulphates is more close to the bell curve. Let’s see the summary of sulphates:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Most of the red wines have the value of sulphates between 0.55 to 0.73 while the Max value is 2.00.

  1. alcohol

The distribution for the alcohol shown in this histogram is right-skewed. Let’s see the log plot.

  1. alcohol(log)

Let’s see the summary for alcohol:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The alcohol values for the red wines are range from 8.40 to 14.90. I wonder if the alcohol effects the quality.

Univariate Analysis

What is the structure of your dataset?

There are 1,599 red wines in the data with 11 chemical properties(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol).

Other Observations:
  • Most of the wines are rated from 5-6.
  • Chemical Properties that is right-skewed: fixed.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, sulphates, alcohol.
  • Most of the red wines have the volatile.acidity from about 0.40 to o.65.
  • Residual.sugar and Chlorides are heavily long-tailed.
  • Chemical Properties that is close to normal distribution: pH, density
  • Most of the red wines have chlorides range between 0.07 to 0.09.
  • Most red wines have residual.sugar values between 1.9 to 2.6.
  • 75% of the red wines have chlorides below 0.09, while the max value is 0.61.
  • The max value of pH is 4.010 which indicates that all the red wines are acid(pH<7).
  • Most of the red wines have the value of sulphates between 0.55 to 0.73 while the Max value is 2.00.
  • The alcohol values for the red wines are range from 8.40 to 14.90.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are quality and alcohol. I’d like to determine which features are best for predicting the quality for red wine. I suspect volatile.acidity and some combination of the other variables can be used to build a predictive model to red wine quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

volatile.acidity, residual.sugar, chlorides, density, sulphates and alcohol likely contribute to the quality of red wine. I think volatile.acidity and alcohol probably contribute most of the quality after researching information on red wines quality.

Did you create any new variables from existing variables in the dataset?

Not at the moment.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I log-transformed the right-skewed distributions.(fixed-acidity, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, sulphates, alcohol)

Bivariate Plots Section

##                           X fixed.acidity volatile.acidity citric.acid
## X                     1.000        -0.268           -0.009      -0.154
## fixed.acidity        -0.268         1.000           -0.256       0.672
## volatile.acidity     -0.009        -0.256            1.000      -0.552
## citric.acid          -0.154         0.672           -0.552       1.000
## residual.sugar       -0.031         0.115            0.002       0.144
## chlorides            -0.120         0.094            0.061       0.204
## free.sulfur.dioxide   0.090        -0.154           -0.011      -0.061
## total.sulfur.dioxide -0.118        -0.113            0.076       0.036
## density              -0.368         0.668            0.022       0.365
## pH                    0.136        -0.683            0.235      -0.542
## sulphates            -0.125         0.183           -0.261       0.313
## alcohol               0.245        -0.062           -0.202       0.110
## quality               0.066         0.124           -0.391       0.226
##                      residual.sugar chlorides free.sulfur.dioxide
## X                            -0.031    -0.120               0.090
## fixed.acidity                 0.115     0.094              -0.154
## volatile.acidity              0.002     0.061              -0.011
## citric.acid                   0.144     0.204              -0.061
## residual.sugar                1.000     0.056               0.187
## chlorides                     0.056     1.000               0.006
## free.sulfur.dioxide           0.187     0.006               1.000
## total.sulfur.dioxide          0.203     0.047               0.668
## density                       0.355     0.201              -0.022
## pH                           -0.086    -0.265               0.070
## sulphates                     0.006     0.371               0.052
## alcohol                       0.042    -0.221              -0.069
## quality                       0.014    -0.129              -0.051
##                      total.sulfur.dioxide density     pH sulphates alcohol
## X                                  -0.118  -0.368  0.136    -0.125   0.245
## fixed.acidity                      -0.113   0.668 -0.683     0.183  -0.062
## volatile.acidity                    0.076   0.022  0.235    -0.261  -0.202
## citric.acid                         0.036   0.365 -0.542     0.313   0.110
## residual.sugar                      0.203   0.355 -0.086     0.006   0.042
## chlorides                           0.047   0.201 -0.265     0.371  -0.221
## free.sulfur.dioxide                 0.668  -0.022  0.070     0.052  -0.069
## total.sulfur.dioxide                1.000   0.071 -0.066     0.043  -0.206
## density                             0.071   1.000 -0.342     0.149  -0.496
## pH                                 -0.066  -0.342  1.000    -0.197   0.206
## sulphates                           0.043   0.149 -0.197     1.000   0.094
## alcohol                            -0.206  -0.496  0.206     0.094   1.000
## quality                            -0.185  -0.175 -0.058     0.251   0.476
##                      quality
## X                      0.066
## fixed.acidity          0.124
## volatile.acidity      -0.391
## citric.acid            0.226
## residual.sugar         0.014
## chlorides             -0.129
## free.sulfur.dioxide   -0.051
## total.sulfur.dioxide  -0.185
## density               -0.175
## pH                    -0.058
## sulphates              0.251
## alcohol                0.476
## quality                1.000

From a subset of the data, sulphates, citric.acid, total.sulfur.dioxide do not seem to have strong correlations with quality, but density is moderately correlated with alcohol. I want to look closer at scatter plots involving quality and some other variables like alcohol, volatile.acidity, and density.

Let’s see the scatter plot for quality and alcohol:

Seems that we can not clearly see the pattern from the plot. Let’s transfrom the data and try to add some jitter:

From the plot, we can see that the distribution in the scatter plot slightly shifts from the left-bottom corner to the top-right, this indicates that red wines’ quality is correlated to the alcohol values in the data. Most red wines in the data have alcohol percantage between 9 to 12.

## 
## Call:
## lm(formula = quality ~ alcohol, data = subset(rw, alcohol <= 
##     quantile(rw$alcohol, 0.999)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8489 -0.4065 -0.1787  0.5176  2.5909 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.81782    0.17512   10.38   <2e-16 ***
## alcohol      0.36646    0.01672   21.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7083 on 1596 degrees of freedom
## Multiple R-squared:  0.2314, Adjusted R-squared:  0.2309 
## F-statistic: 480.4 on 1 and 1596 DF,  p-value: < 2.2e-16

From the summary, we know that based on R^2 value, alcohol explains 23.14% of red wines’ quality.

Let’s see the scatter plot for quality and volatile.acidity:

The data suffers from overplotting, let’s add some jitter, alpha and transforming the data using log.

After applying jitter, alpha and log-transform, we can se the distribution of the data in the plot move from top-left to bottom-right which indicates that quality is negetively correlated with red wine quality. Most red wines have volatile.acidity between 0.25 to 0.75. We can tell that red wines with higher volatile.acidity tend to have lower quality while red wines with lower volatile.acidity tend to have better quality.

## 
## Call:
## lm(formula = quality ~ volatile.acidity, data = subset(rw, volatile.acidity <= 
##     quantile(rw$volatile.acidity, 0.999)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.78977 -0.54547 -0.01325  0.47198  2.92568 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.55757    0.05841  112.27   <2e-16 ***
## volatile.acidity -1.74500    0.10503  -16.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7436 on 1596 degrees of freedom
## Multiple R-squared:  0.1474, Adjusted R-squared:  0.1469 
## F-statistic:   276 on 1 and 1596 DF,  p-value: < 2.2e-16

From the summary, we know that based on R^2 value, alcohol only explains 14.74% of red wines’ quality.

Finally, let’s see the scatter plot between quality and density:

The data suffers from overplotting, let’s add some jitter, alpha and transforming the data using log.

## 
## Call:
## lm(formula = quality ~ density, data = subset(rw, density <= 
##     quantile(rw$density, 0.999)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7918 -0.6200  0.1504  0.4262  2.5233 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    82.43      10.60   7.779 1.31e-14 ***
## density       -77.04      10.63  -7.247 6.62e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7952 on 1595 degrees of freedom
## Multiple R-squared:  0.03188,    Adjusted R-squared:  0.03127 
## F-statistic: 52.52 on 1 and 1595 DF,  p-value: 6.616e-13

Based on the plot and the summary, we know that comparing density to quality, most red wines have density between 0.9950 to 0.9975. Quality and density is lack of correlation.

Next, I’ll look at the scatter plot between other chemical features with red wine quality. (sulphates, citric.acid, total.sulfur.dioxide, total.sulfur.dioxide)

First, let’s see the scatter plot for quality and sulphates:

The data suffers from overplotting, let’s add some jitter, alpha and transforming the data using log.

It seems to be some correlation between quality and sulphates. Let’s dive into the summary for linear model to know more.

## 
## Call:
## lm(formula = quality ~ sulphates, data = subset(rw, sulphates <= 
##     quantile(rw$sulphates, 0.999)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.91625 -0.53267  0.07005  0.45363  2.39883 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.73813    0.08061   58.78   <2e-16 ***
## sulphates    1.36990    0.11917   11.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7756 on 1595 degrees of freedom
## Multiple R-squared:  0.07651,    Adjusted R-squared:  0.07593 
## F-statistic: 132.1 on 1 and 1595 DF,  p-value: < 2.2e-16

Based on R^2 value, sulphates only explains 7.651% of red wines’ quality!

Now, let’s see the scatter plot for quality and citric.acid:

The data suffers from overplotting, let’s add some jitter, alpha and transforming the data using log.

## 
## Call:
## lm(formula = quality ~ citric.acid, data = subset(rw, citric.acid <= 
##     quantile(rw$citric.acid, 0.999)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.01809 -0.59820  0.09909  0.50922  2.59711 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.37360    0.03371 159.384   <2e-16 ***
## citric.acid  0.97651    0.10144   9.627   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7847 on 1595 degrees of freedom
## Multiple R-squared:  0.05491,    Adjusted R-squared:  0.05432 
## F-statistic: 92.68 on 1 and 1595 DF,  p-value: < 2.2e-16

Based on the plot and the summary, the horizontal strips in the plot and the R^2 values indicate that quality and citric.acid is lack of correlation.

Let’s see the scatter plot for quality and total.sulfur.dioxide:

The data suffers from overplotting, let’s add some jitter, alpha and transforming the data using log.

## 
## Call:
## lm(formula = quality ~ total.sulfur.dioxide, data = subset(rw, 
##     total.sulfur.dioxide <= quantile(rw$total.sulfur.dioxide, 
##         0.999)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8299 -0.6300  0.1964  0.3858  2.5857 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.8772219  0.0348085 168.845   <2e-16 ***
## total.sulfur.dioxide -0.0052610  0.0006208  -8.475   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7893 on 1595 degrees of freedom
## Multiple R-squared:  0.04309,    Adjusted R-squared:  0.04249 
## F-statistic: 71.82 on 1 and 1595 DF,  p-value: < 2.2e-16

Based on the plot and the summary, the horizontal strips in the plot indicate that quality and total.sulfur.dioxide is lack of correlation.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Red wine quality correlates with alcohol(23.14%) and volatile.acidity(14.74%) and slightly with sulphates(7.65%).

As alcohol increases, the variance in quality increase. In the plot of quality versus alcohol, there are horizontal bands where many red wines take on the same quality value at different alcohol points. Based on the R^2 value, alcohol explains only about 23 percent of the variance in quality. Other features of interest can be incorporated into the model to explain the variance in the quality.

Red wines with higher volatile.acidity tend to have lower quality while red wines with lower volatile.acidity tend to have better quality.

We found that quality is lack of correlation with citric.acid, total.sulfur.dioxide and density.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

  • fixed.acidity appears to have a strong correlation with citric.acid, this may due to they are both acidity chemical features.
  • free.sulfur.dioxide appears to have a strong correlation with total.sulfur.dioxide, this may due to they are both a kind of sulfur.dioxide.
  • fixed.acidity appears to have a strong and negative correlation with pH, this lines up with the definition of pH values.

What was the strongest relationship you found?

The quality of red wines is positively and strongly correlated with sulphates and slightly correlated with alcohol and volatile.acidity..

Multivariate Plots Section

Let’s plot the relationship between quality and alcohol using volatile.acidity in color paramater.

I noticed that there are a few blank region in the data. This may due to the fact that quality datas are in interger format. The graph present linear relationship between quality and alcohol.

Let’s see the plot for alcohol and quality using sulphates in the color parameter.

The results from the log transformation shows that the darker points are on the bottom-left while light points are on the top-right.

Let’s plot the relationship of quality and volatile.acidity using sulphates in color:

From the plot we can see that quality and volatile.acidity are negatively related.

Next, let’s feed in the model:

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = rw)
## m2: lm(formula = quality ~ alcohol + sulphates, data = rw)
## m3: lm(formula = quality ~ alcohol + sulphates + volatile.acidity, 
##     data = rw)
## 
## ==============================================================
##                          m1            m2            m3       
## --------------------------------------------------------------
##   (Intercept)           1.875***      1.375***      2.611***  
##                        (0.175)       (0.177)       (0.196)    
##   alcohol               0.361***      0.346***      0.309***  
##                        (0.017)       (0.016)       (0.016)    
##   sulphates                           0.994***      0.679***  
##                                      (0.102)       (0.101)    
##   volatile.acidity                                 -1.221***  
##                                                    (0.097)    
## --------------------------------------------------------------
##   R-squared             0.227         0.270         0.336     
##   adj. R-squared        0.226         0.269         0.335     
##   sigma                 0.710         0.690         0.659     
##   F                   468.267       294.988       268.912     
##   p                     0.000         0.000         0.000     
##   Log-likelihood    -1721.057     -1675.142     -1599.384     
##   Deviance            805.870       760.894       692.105     
##   AIC                3448.114      3358.284      3208.768     
##   BIC                3464.245      3379.793      3235.654     
##   N                  1599          1599          1599         
## ==============================================================

From the table, we know that even with nice . The model can only explains 33.6% of quality datas.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

  • Using color on the plot between quality and alcohol help explains the relationship in between.
  • Using color on the plot between quality and volatile.acidity help explains the relationship in between.
  • There’s positive relationship between quality and alcohol datas.
  • There’s negative relationship between quality and volatile.acisidty

Were there any interesting or surprising interactions between features?

Sulphates in Quality by volatile.acidity plot. The colors in the plot explains everything: light on the top-left and condense on bottom-right.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, I created a linear model using alcohol, sulphates and volatile.acisidty. The model is built on the variables that has closer relationship with quality which make the model more robust. The limitation is that since all the variables are in fact not strongly related with quality. As a result, the model only explains about 33% of quality.


Final Plots and Summary

Plot One

Description One

The histogram of Alcohol(%) is right-skewed abd without very long-tailed shape which looks more familiar with the distribution of quality ratings.

Plot Two

Description Two

After taking the jitter and log transition, we can see that alcohol appears to have positive relationship with quality. Volatile.acidity(g/cm^3) appears to have negative relationship with quality.

Plot Three

Description Three

In this plot, we can clearly see the relationship between volatile.acidity(g/cm^3) and quality is negative. Besides, from the color we can see that light sulphates data (g/cm^3) exists more on the top-left in the plot while condense sulphates data(g/cm^3) appears mostly on the bottom-right in the plot.


Reflection

Throughout the analysis, I was first blocked by the background knowledge of red wine. For me, all the chemical terms are so unfamiliar. I solved this by Google “red wine chemical compounds”" and read through some articles explaining all the terms.

My second challenge is that after finishing the “Univariate” section, I don’t have a robust evidence of where I should move on to analyze, I can only “guess” what might be interesting to explore based on the shape of the histogram. This is solved after I started the bivariate section and using cor() and ggpairs() functions to find out which variables has strong relationship with quality.

Third, I reach a dead end that even though I apply the log transmition, the data still didn’t look good. I’m not able to find anything interesting based on the plto. The issue was solved after I go back to watch the courses and applying jitter() to my plot. After adding this function, the plot looks better and I’m able to tell something in the data.

Besides, there might be other factors that effects the results of the analysis: - Country/Region: where each red wine is produced. - Storage: is the wine stored properly? - Year of production: this might effects the quality a lot.

In addition, since all the variables are not strongly correlated with quality, the model may be biased.